This is part of my current paper about political reporting in German online news.

To measure the ideological content of several major online news services, I compare the topics discussed in these media with the press releases of the Bundestag parties using a structural topic model.

The following is an analysis of the content of the press releases scraped from the public websites of the political parties and political groups. A big part of this analysis is inspired from the work of Julia Silge and David Robinson (Text Mining with R - A Tidy Approach).

I assume that parties utilize their press releases to promote their issues and positions and thus also contribute to the election campaign. However, it should be noted that there is a difference between the press releases of the parties and the factions. Parties are financed by membership dues, donations and campaign expenses, while factions are financed by state funds. According to Parteigesetzt §25 (2) state funded factions may not support parties from their funds, because otherwise parties that are not in the Bundestag would be practically disadvantaged.

Since it is difficult to draw the line between faction activity and election campaign assistance, I assume that factions intervene in the public perception of this party with their press releases, which is why I include both the press releases of the federal party and the federal faction.

Load Data

CDU

FDP

B90 / Die Grünen

Clean Data

  1. Remove…
  1. Stemming
title_text text_cleaned
760 Uwe Witt begrüßt Vorschlag der Rentenversicherung zur steuerfinanzierten Mütterrente . Januar 2018. Die Deutsche Rentenversicherung warnt die „GroKo-Sondierer“ CDU/CSU und SPD in ne neue Regierungskoalition müsse eine Finanzierung der Mütterrente regeln. Ein Ausbau dürfe nicht zu Lasten des Beitragszahlers gehen. „Alle  Mehrausgaben, die der Rentenversicherung durch die Finanzierung zusätzlicher Mütterrenten für Geburten vor 1992 entstehen, müssen sach- und systemgerecht aus Steuermitteln finanziert werden“, heißt es in einem in der Nacht zum Dienstag verbreiteten Beschluss der Bundesvertreterversammlung.Der Bundestagsabgeordnete Uwe Witt, kommissarischer Sprecher des Arbeitskreises „Arbeit & Soziales der AfD-Bundestagsfraktion und Leiter des Bundesfachausschuss 11 (Soziale Sicherungssysteme und Rente, Arbeits- und Sozialpolitik) ist, hat das Nein der Alternative für Deutschland (AfD) zu einer Ausweitung der Mütterrente aus Beitragsmitteln erwartungsgemäß bekräftigt:„Da es sich bei den Mehrausgaben um beitragsfremde Leistungen handelt, sind diese, wie im AfD-Programm vorgesehen, aus Steuermitteln zu finanzieren.“, so Witt am Rande einer parteiinternen Veranstaltung in .Die von der CSU geforderte Ausweitung der Mütterrente soll laut Informationen der Rentenversicherung sieben Milliarden Euro kosten. Witt sagt: Es freut uns, dass die Deutsche Rentenversicherung uns, als der größten Oppositionspartei im Bundestag, zustimmt, dass die Ausgaben für die Erweiterung der Mütterrente aus Steuermitteln zu finanzieren sind.“ uwe witt begrüßt vorschlag rentenversicherung steuerfinanzierten mütterrente deutsche rentenversicherung warnt groko sondierer ne regierungskoalition müsse finanzierung mütterrente regeln ausbau dürfe lasten beitragszahlers   mehrausgaben rentenversicherung finanzierung zusätzlicher mütterrenten geburten entstehen sach systemgerecht steuermitteln finanziert heißt verbreiteten beschluss bundesvertreterversammlung uwe witt kommissarischer arbeitskreises arbeit soziales leiter bundesfachausschuss soziale sicherungssysteme rente arbeits sozialpolitik alternative deutschland ausweitung mütterrente beitragsmitteln erwartungsgemäß bekräftigt mehrausgaben beitragsfremde leistungen handelt programm vorgesehen steuermitteln finanzieren witt rande parteiinternen veranstaltung geforderte ausweitung mütterrente informationen rentenversicherung milliarden euro kosten witt freut deutsche rentenversicherung größten oppositionspartei bundestag zustimmt ausgaben erweiterung mütterrente steuermitteln finanzieren

Inspect Data

Tokens

tokens <- pressReleases %>% unnest_tokens(word, text_cleaned1)

tokens.count <- tokens %>%
  count(party, word, sort = TRUE) %>%
  ungroup() %>%
  bind_tf_idf(word,party,n)
tokens.count %>% 
  arrange(desc(tf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(party) %>% 
  top_n(15) %>% 
  ungroup %>%
  ggplot(aes(word, tf, fill = party)) +
  geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
  labs(x = NULL, y = "Term Frequency") +
  facet_wrap(~party, ncol = 3, scales = "free") +
  coord_flip()

Compare the word frequency for the different parties.

  • an empty space at low frequency indicates less similarity between two parties.

  • if words in a two-sided panel are closer to the zero-slope line the two parties use more similar words.

frequency <- tokens.count %>%
  group_by(party) %>%
  mutate(proportion = n/sum(n)) %>%
  select(party, word, proportion) %>%
  spread(party, proportion) 

CDU

frequency %>% 
  gather(party, proportion, -word, -CDU) %>%
  ggplot(aes(x = proportion, y = `CDU`, color = abs(`CDU` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~party, nrow = 2) +
  theme(legend.position="none") +
  labs(y = "CDU", x = NULL)

#ggsave("../figs/word_freq_CDU.png", width = 15, height = 10)

SPD

frequency %>% 
  gather(party, proportion, -word, -SPD) %>%
  ggplot(aes(x = proportion, y = `SPD`, color = abs(`SPD` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~party, nrow = 2) +
  theme(legend.position="none") +
  labs(y = "SPD", x = NULL)

#ggsave("../figs/word_freq_SPD.png", width = 15, height = 10)

FDP

frequency %>% 
  gather(party, proportion, -word, -FDP) %>%
  ggplot(aes(x = proportion, y = `FDP`, color = abs(`FDP` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~party, nrow = 2) +
  theme(legend.position="none") +
  labs(y = "FDP", x = NULL)

#ggsave("../figs/word_freq_FDP.png", width = 15, height = 10)

B90/GRÜNE

frequency %>% 
  gather(party, proportion, -word, -`B90/GRÜNE`) %>%
  ggplot(aes(x = proportion, y = `B90/GRÜNE`, color = abs(`B90/GRÜNE` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~party, nrow = 2) +
  theme(legend.position="none") +
  labs(y = "B90/GRÜNE", x = NULL)

#ggsave("../figs/word_freq_GRUENE.png", width = 15, height = 10)

DIE LINKE

frequency %>% 
  gather(party, proportion, -word, -`DIE LINKE`) %>%
  ggplot(aes(x = proportion, y = `DIE LINKE`, color = abs(`DIE LINKE` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~party, nrow = 2) +
  theme(legend.position="none") +
  labs(y = "DIE LINKE", x = NULL)

#ggsave("../figs/word_freq_LINKE.png", width = 15, height = 10)

AfD

frequency %>% 
  gather(party, proportion, -word, -`AfD`) %>%
  ggplot(aes(x = proportion, y = `AfD`, color = abs(`AfD` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~party, nrow = 2) +
  theme(legend.position="none") +
  labs(y = "AfD", x = NULL)

#ggsave("../figs/word_freq_AfD.png", width = 15, height = 10)

TF-IDF

The statistic tf-idf (term frequency - inverse document frequency) is intended to measure how important a word is to a document in a collection (or corpus) of documents. In this case we measure how important a word is to a party (within all the press releases of that party) in the collection of all parties (and their press releases).

The inverse document frequency for any given term is defined as

\[ idf\text{(term)}=\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \]

In this case, \(n_{\text{documents}} = 6\) as we have 6 different parties.

Terms with low tf-idf:

tokens.count %>%
  arrange(tf_idf) 
## # A tibble: 55,124 x 6
##    party     word                n      tf   idf tf_idf
##    <chr>     <chr>           <int>   <dbl> <dbl>  <dbl>
##  1 DIE LINKE bundesregierung   566 0.0109      0      0
##  2 AfD       deutschland       470 0.0128      0      0
##  3 DIE LINKE deutschland       330 0.00637     0      0
##  4 DIE LINKE eu                324 0.00625     0      0
##  5 DIE LINKE menschen          286 0.00552     0      0
##  6 AfD       deutschen         235 0.00641     0      0
##  7 FDP       deutschland       226 0.0109      0      0
##  8 DIE LINKE endlich           219 0.00423     0      0
##  9 AfD       eu                208 0.00567     0      0
## 10 DIE LINKE vorsitzend        208 0.00401     0      0
## # ... with 55,114 more rows

A 0 idf (and thus tf-idf) indicate, that these terms appear in all six parties press-releases.

The inverse document frequency (and thus tf-idf) is very low (0) for terms that occur in many (all) of the documents (all press releases of one party) in a collection (all press releases of one party);

Terms with high tf-idf.

tokens.count %>%
  arrange(desc(tf_idf))
## # A tibble: 55,124 x 6
##    party     word                  n      tf   idf  tf_idf
##    <chr>     <chr>             <int>   <dbl> <dbl>   <dbl>
##  1 AfD       weidel              185 0.00505  1.79 0.00904
##  2 FDP       beer                 59 0.00286  1.79 0.00512
##  3 FDP       nicola               58 0.00281  1.79 0.00503
##  4 AfD       pazderski            98 0.00267  1.79 0.00479
##  5 AfD       alic                145 0.00395  1.10 0.00434
##  6 FDP       lambsdorff           42 0.00203  1.79 0.00364
##  7 DIE LINKE dagdelen            102 0.00197  1.79 0.00353
##  8 FDP       präsidiumsmitgli     57 0.00276  1.10 0.00303
##  9 AfD       brandner             62 0.00169  1.79 0.00303
## 10 FDP       generalsekretärin    54 0.00261  1.10 0.00287
## # ... with 55,114 more rows
tokens.count %>% 
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(party) %>% 
  top_n(15) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = party)) +
  geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~party, ncol = 3, scales = "free") +
  coord_flip()

#ggsave("../figs/tf-idf.png", width = 11, height = 6)

N-Grams

Words can be considered not only as single units, but also as their relationship to each other. N-grams, for example, help to investigate which words tend to follow others immediately. To do this, we tokenize the text into successive sequences of words called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

Bigrams

bigrams <- pressReleases %>% unnest_tokens(bigram, text_cleaned1, token="ngrams", n=2)

bigrams.count <- bigrams %>%
  count(party, bigram, sort = TRUE) %>%
  ungroup() %>%
  bind_tf_idf(bigram,party,n)
bigrams.count %>% 
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>% 
  group_by(party) %>% 
  top_n(15) %>% 
  arrange(desc(tf_idf)) %>%
  ungroup %>%
  ggplot(aes(reorder(bigram, tf_idf), tf_idf, fill = party)) +
  geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~party, ncol = 3, scales = "free") +
  coord_flip()

#ggsave("../figs/tf-idf_bigram.png", width = 11, height = 6)

Trigrams

trigrams <- pressReleases %>% unnest_tokens(trigram, text_cleaned1, token="ngrams", n=3)

trigrams.count <- trigrams %>%
  count(party, trigram, sort = TRUE) %>%
  ungroup() %>%
  bind_tf_idf(trigram,party,n)
trigrams.count %>% 
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(trigram, levels = rev(unique(trigram)))) %>% 
  group_by(party) %>% 
  top_n(15) %>% 
  arrange(desc(tf_idf)) %>%
  ungroup %>%
  ggplot(aes(reorder(trigram, tf_idf), tf_idf, fill = party)) +
  geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~party, ncol = 3, scales = "free") +
  coord_flip()

#ggsave("../figs/tf-idf_trigram.png", width = 12, height = 6)